Matrix factorization routines on heterogeneous architectures
نویسندگان
چکیده
In this work we consider a method for parallelizing matrix factorization algorithms on systems with Intel © Xeon Phi TM coprocessors. We provide performance results of matrix factorization routines implementing this approach and available in Intel © Math Kernel Library (Intel MKL) on the Intel © Xeon © processor line with Intel Xeon Phi coprocessors. Summary New heterogeneous systems consisting of a multicore CPU with coprocessors introduce new challenges to designing efficient parallel algorithms. Simultaneous usage of all computational resources of such systems for solving one large problem requires uneven distribution of data and computations that leads to more complex parallelization methods. In this work we present a new parallelization method for matrix factorization that efficiently utilizes all computational resources of a heterogeneous system consisting of a multicore CPU with coprocessors. We will show how this method can be applied for parallelization of the key linear algebra factorization algorithms (QR, LU, and Cholesky) on systems with Intel Xeon Phi coprocessors. Our matrix factorization method is based on the panel factorization approach [5]. The panel factorization approach has advantages over communication avoiding, tile methods [5], and their combination [8]: • no additional computational cost; • no additional memory consumption. The panel factorization approach has the same computational cost and memory usage as classic LAPACK algoFigure 1: Algorithm represented as DAG rithms [4], which makes this approach preferable for systems with coprocessors. The implementation preserves the LAPACK standard interfaces and data layout. The algorithm can be applied to any matrices. The implementation of our method is DAG-based [6] and uses panel factorization kernels that were redesigned and rewritten for new Intel Xeon Phi products [7] [9]. Figure 1 shows the algorithm represented as DAG. The algorithm implementation has the following features: • At the beginning, CPUs produce a number of factorized panels and send to coprocessors as many panels as needed to maximize coprocessor utilization. • Coprocessors perform ”update” stages in parallel. • CPUs perform both ”factorization” stages and ”update” stages in parallel. • To achieve the best load balance, a coprocessor may send a panel back to the CPU side on any process stage. The proposed method provides a high degree of parallelism while minimizing synchronizations and communications. The algorithm enables adaptable workload distribution between CPUs and coprocessors to improve load balancing, namely: • Adaptable data/task distribution on the fly between CPUs and coprocessors. • No limit on number of coprocessors on heterogenous systems. • Scalability. A system with CPUs and one coprocessor shows 3x performance improvement and a system with CPUs and two coprocessors shows 5x performance improvement. • No algorithmic limitations on matrix sizes. Our algorithm is implemented within the framework of Intel MKL [3] LU, QR and Cholesky factorization routines. The implemented routines detect the presence of Intel Xeon Phi coprocessors and automatically offload the computations that benefit from additional computational resources. This usage model hides the complexity of heterogeneous systems from the user, providing ease of use and the same API as usual Intel MKL routines. This parallelization method can be effectively applied toother LAPACK [4] algorithms. 1. REFERENCES[1] Netlib A collection of mathematical software, papers,and databases. http://www.netlib.org.[2] Jakub Kurzak, and Jack Dongarra. ImplementingLinear Algebra Routines on Multi-Core Processorswith Pipelining and a Look Ahead. UT-CS-06-581,September 2006. LAPACK Working note #178.[3] Intel© Math Kernel Library,http://www.intel.com/software/products/mkl.[4] LAPACK Linear Algebra PACKage,http://www.netlib.org/lapack.[5] Sergey V. Kuznetsov. An approach of the QRfactorization for tall-and-skinny matrices on multicoreplatforms, 11th International conference, Proceedingsof PARA 2012, Helsinki, Finland, LNCS, vol. 7782,Springer Verlag, pp. 235-249, 2013[6] A. Kobotov, S. V. Kuznetsov. Efficient dynamicparallelization for the QR factorization. InProceedings of PARA 2008: 9th InternationalWorkshop on State-of-the-Art in Scientific andParallel Computing, 2008.[7] Alexander Heinecke, Vaidyanathan Karthikeyan,Smelyanskiy Mikhail, Kobotov Alexander V, DubtsovRoman S, Henry Greg, Shet Aniruddha G, ChrysosGeorge, Dubey Pradeep. Design and Implementationof the Linpack Benchmark for Single and Multi-NodeSystems Based on Intel© Xeon PhiTMCoprocessor,IEEE International Parallel & Distributed ProcessingSymposium (IPDPS), 2013. [8] E. Agullo, C. Augonnet, J. Dongarra, M. Faverge, H.Ltaief, S. Thibault, and S. Tomov. ’QR Factorizationon a Multicore Node Enhanced with Multiple GPUAccelerators’, IPDPS 2011. [9] Michael Deisher, Mikhail Smelyanskiy, BrianNickerson, Victor W. Lee, Michael Chuvelev, PradeepDubey. Designing and Dynamically Load BalancingHybrid LU for Multi/Many-core, InternationalSupercomputer Conference, 2011.
منابع مشابه
A collection of parallel linear equations routines for the Denelcor HEP
This paper describes the implementation and performance results for a few standard linear algebra routines on the Denelcor HEP computer. The algorithms used here are based on high-level modules that facilitate portability and perform efficiently in a xvide range of environments:The modules are chosen to be of a large enough computational granularity so that reasonably optimum performance may be...
متن کاملOne-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators
One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) one-sided factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that ...
متن کاملOne-sided dense matrix factorizations on a multicore with multiple GPU accelerators in MAGMA1
One-sided dense matrix factorizations are important computational kernels in many scientific and engineering simulations. In this paper, we propose two extensions of both right-looking (LU and QR) and left-looking (Cholesky) factorization algorithms to utilize the computing power of current heterogeneous architectures. We first describe a new class of non-GPU-resident algorithms that factorize ...
متن کاملRing-oriented Block Matrix Factorization Algorithms for Shared and Distributed Memory Architectures
Utilizing experiences from the implementations on shared memory multiprocessors (SMM) and distributed memory multicomputers (DMM), general ring-oriented routines are developed for the LU, Cholesky, and QR factorizations. Since, all machine dependencies are comprised to a small set of communication routines, the same factorization routines can be used on both the SMM and DMM architectures. The a...
متن کاملA Modified Digital Image Watermarking Scheme Based on Nonnegative Matrix Factorization
This paper presents a modified digital image watermarking method based on nonnegative matrix factorization. Firstly, host image is factorized to the product of three nonnegative matrices. Then, the centric matrix is transferred to discrete cosine transform domain. Watermark is embedded in low frequency band of this matrix and next, the reverse of the transform is computed. Finally, watermarked ...
متن کامل